How complete is the CDC's COVID-19 case surveillance data for race/ethnicity at the state and county levels?

February 9, 2021

In [ ]:
#@title
import pandas as pd
import altair as alt
from vega_datasets import data

from google.colab import auth
auth.authenticate_user()

# Turn off the three-dot menu for Altair/Vega charts.
alt.renderers.set_embed_options(actions=False)
#%load_ext google.colab.data_table

# See the next cell for instructions on how to update the data.
In [ ]:
#@title
# How to update the data:
# 0. You may need to copy this colab so you have your own version.
# 1. Update the cdc_table to have the latest data's suffix.
# 2. Update the date variables below to be the last case date included in the data.
#    If the CRDT doesn't have data on that exact date, choose the closest date for crdt_date.
# 3. If the last case date is after Feb 2, 2021, you'll need to upload a new version
#    of the crdt data to compare against and change the crdt_table name below.
# 4. Scatterplot max/min below in chart settings may need to be updated for more cases.
# 5. There are a few checks for the county_fips_mapping that we created due to issues with the CDC's.
#    Instructions are at https://docs.google.com/spreadsheets/d/1AVSSge7BpkbNL4PfumUZpL7hokMLjKUojtamQjNW6f0/edit?resourcekey=0-Abdprx3fy_pXikSCDV2hxw#gid=967935006.
# 6. Many/all of the tables and text are not auto-updated. If you want to do a full updated of
#    the paper including text and tables, a lot of that is done in commented out PrintSummaryStats().

project_id = 'msm-secure-data-1b'
cdc_table = '`%s.ndunlap_secure.cdc_restricted_access_20210131`' % project_id
crdt_table = '`%s.ndunlap_secure.crdt_20210203`' % project_id
crew_table = '`msm-internal-data.crew.covid_case_surveillance`'
date = 'DATE(2021, 01, 16)'
crdt_date = '20210117'
date_display_name = 'Jan 16'

# Set the scatterplot max/min to better handle outliers (CA, Los Angeles).
total_cases_scale_max = 3500000
county_cases_scale_max = 1100000
county_cases_zoom_scale_max = 100000
cases_known_scale_max = 2200000 # known race/ethnicity

# Chart settings.
scatter_height = 300
scatter_width = 300
map_height = 300
map_width = 450
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

territories = ('PR', 'GU', 'VI', 'MP', 'AS')
nyt_territories = ('Puerto Rico', 'Guam', 'Virgin Islands', 'Northern Mariana Islands', 'American Samoa')
states_to_fips = {'AL': 1, 'AK': 2, 'AZ': 4, 'AR': 5, 'AS': 3, 'CA': 6, 'CO': 8, 'CT': 9, 'DC': 11, 'DE': 10, 'FL': 12, 'GA': 13, 'GU': 14, 'HI': 15, 'ID': 16, 'IL': 17, 'IN': 18, 'IA': 19, 'KS': 20, 'KY': 21, 'LA': 22, 'ME': 23, 'MD': 24, 'MA': 25, 'MI': 26, 'MN': 27, 'MS': 28, 'MO': 29, 'MT': 30, 'NE': 31, 'NV': 32, 'NH': 33, 'NJ': 34, 'NM': 35, 'NY': 36, 'NYC': 36, 'NC': 37, 'ND': 38, 'OH': 39, 'OK': 40, 'OR': 41, 'PA': 42, 'PR': 43, 'RI': 44, 'SC': 45, 'SD': 46, 'TN': 47, 'TX': 48, 'UT': 49, 'VT': 50, 'VA': 51, 'VI': 52, 'WA': 53, 'WV': 54, 'WI': 55, 'WY': 56, 'AS': 60, 'GU': 66, 'MP': 69, 'PR': 72, 'VI': 78}
race_ethnicity_combined_map = {
    'Asian, Non-Hispanic': 'asian_cases',
    'Black, Non-Hispanic': 'black_cases',
    'White, Non-Hispanic': 'white_cases',
    'American Indian/Alaska Native, Non-Hispanic': 'aian_cases',
    'Hispanic/Latino': 'hispanic_cases',
    'Multiple/Other, Non-Hispanic': 'other_cases',
    'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'nhpi_cases',
    'Missing': 'unknown_cases',
    'Unknown': 'unknown_cases',
    'NA': 'na_cases',
}
In [ ]:
#@title
crdt_query = ('''
SELECT
  State as state,
  Cases_Total as crdt_cases,
  Cases_Total - Cases_Unknown as crdt_known_race_cases,
  ROUND(1 - Cases_Unknown / Cases_Total, 4) as crdt_known_race_cases_percent,
FROM %s
WHERE
  date = %s
''' % (crdt_table, crdt_date))

nyt_states_query = ('''
SELECT
  state_name,
  state_fips_code,
  confirmed_cases as nyt_cases,
  deaths as nyt_deaths
FROM `bigquery-public-data.covid19_nyt.us_states`
WHERE
  date = %s AND
  state_fips_code IS NOT NULL
''' % date)

nyt_counties_query = ('''
SELECT
  county_fips_code,
  confirmed_cases as nyt_cases,
FROM `bigquery-public-data.covid19_nyt.us_counties`
WHERE
  date = %s AND
  county_fips_code IS NOT NULL
''' % date)

cdc_states_query = ('''
SELECT
  res_state,
  COUNT(*) as cdc_cases
FROM
  %s
GROUP BY
   res_state
''' % cdc_table)

cdc_counties_query = ('''
SELECT
  res_state,
  res_county,
  race_ethnicity_combined,
  COUNT(*) as cases
FROM
  %s
GROUP BY
   res_county,
   res_state,
   race_ethnicity_combined
''' % cdc_table)

compare_cases_unknowns_query = ('''
SELECT
  res_state,
  race_ethnicity_combined,
  COUNT(*) as cdc_cases
FROM
  %s
GROUP BY
   res_state,
   race_ethnicity_combined
''' % cdc_table)

cdc_states_by_month_query = ('''
SELECT
  res_state,
  CONCAT(EXTRACT(YEAR from cdc_case_earliest_dt), '-Q', EXTRACT(QUARTER from cdc_case_earliest_dt)) as date,
  COUNT(*) as total_cases,
FROM
  %s
WHERE
  cdc_case_earliest_dt >= DATE(2020, 1, 1) AND
  cdc_case_earliest_dt < DATE(2021, 1, 1) AND
  res_state in ('AK', 'CA', 'CT', 'DE', 'GA', 'LA', 'MD', 'ND', 'NY', 'PA', 'RI')
GROUP BY
   1, 2
ORDER BY
   1, 2
''' % cdc_table)

cdc_states_by_month_known_or_na_query = ('''
SELECT
  res_state,
  CONCAT(EXTRACT(YEAR from cdc_case_earliest_dt), '-Q', EXTRACT(QUARTER from cdc_case_earliest_dt)) as date,
  COUNT(*) as known_or_na_cases,
FROM
  %s
WHERE
  cdc_case_earliest_dt >= DATE(2020, 1, 1) AND
  cdc_case_earliest_dt < DATE(2021, 1, 1) AND
  race_ethnicity_combined != 'Unknown' AND
  race_ethnicity_combined != 'Missing'
GROUP BY
   1, 2
ORDER BY
   1, 2
''' % cdc_table)

cdc_overall_query = ('''
SELECT
  race_ethnicity_combined,
  COUNT(*) as cases
FROM
  %s
GROUP BY
   1
''' % cdc_table)

county_fips_mapping_query = ('''
SELECT
*
FROM
  `msm-secure-data-1b.ndunlap_secure.county_fips_mapping`
''')

acs_population_data_query = ('''
SELECT
  state,
  county,
  county_fips,
  total_pop
FROM
  `msm-internal-data.ipums_acs.acs_2019_5year_county`
''')
In [ ]:
#@title
def FieldAnalysis(project_id, table, field_list):
  dict = {}
  for field in field_list:
      dict[field] = [0.0, 0.0, 0.0, 0.0]
  unknowns = pd.DataFrame(dict, index=['Unknown', 'Missing', 'NA', 'Known'])
  field_series = []
  value_series = []
  percent_series = []

  for field in field_list:
    field_unknowns_query = ('''
    SELECT
      %s,
      count(*) as cases
    FROM
      %s
    GROUP BY
      %s
    ''')
    query = field_unknowns_query % (field, table, field)
    field_unknowns_df = pd.io.gbq.read_gbq(query, project_id=project_id)
    field_unknowns_df.set_index(field, inplace=True)
    field_unknowns_df.index = field_unknowns_df.index.fillna('Null')

    field_display_name = {
        'cdc_case_earliest_dt': 'CDC earliest case date',
        'current_status': 'Case status',
        'res_state': 'State',
        'res_county': 'County',
        'sex': 'Sex',
        'age_group': 'Age',
        'race_ethnicity_combined': 'Race/Ethnicity'}

    missing_count = 0
    if 'Missing' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc['Missing'].cases
    if 'Null' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc['Null'].cases
    if '' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc[''].cases
    unknowns.loc['Missing', field] = missing_count / field_unknowns_df.cases.sum()

    if 'Unknown' in field_unknowns_df.index:
      unknowns.loc['Unknown', field] = field_unknowns_df.loc['Unknown'].cases / field_unknowns_df.cases.sum()
    if 'NA' in field_unknowns_df.index:
      unknowns.loc['NA', field] = field_unknowns_df.loc['NA'].cases / field_unknowns_df.cases.sum()
    unknowns.loc['Known', field] = 1 - (unknowns.loc['Missing', field] +
                                        unknowns.loc['Unknown', field] +
                                        unknowns.loc['NA', field])
    field_series.extend([field_display_name.get(field, field)] * 4)
    value_series.extend(['Known', 'Supressed', 'Unknown', 'Missing'])
    percent_series.extend([unknowns.loc['Known', field],
                           unknowns.loc['NA', field],
                           unknowns.loc['Unknown', field],
                           unknowns.loc['Missing', field]])
  test = pd.DataFrame.from_dict({'field': field_series,
                               'value': value_series,
                               'percent': percent_series})
  return alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='%'), title=''),
      y=alt.Y('field', sort='x', title='Field'),
      color=alt.Color('value', scale=alt.Scale(scheme='category20'), title='Value'),
      order=alt.Order('field:N'),
      tooltip=[
                  alt.Tooltip('field:N', title='Field'),
                  alt.Tooltip('value:N', title='Value'),
                  alt.Tooltip('percent:Q', format=',.0%', title='Percent'),
      ]
  )

Background

The racial and ethnic disparities in the COVID-19 pandemic have exposed longstanding health inequities in the U.S., which have been described in multiple analyses of COVID-19 data by the Covid Tracking Project, New York Times, American Public Media Research Lab, and Kaiser Family Foundation among many others. Unfortunately, we still don't have a full understanding of these disparities because of the fragmented landscape of race/ethnicity data. On January 29, the Covid Tracking Project wrote, "the continued lack of either complete federal demographic data or federal guidelines for what states should publish make it impossible to fully understand who is being infected with and dying of COVID-19." Despite collecting the most comprehensive data on race/ethnicity, the Covid Tracking Project is still missing race data for a third of cases.

The most reliable and up-to-date data are scattered across state and local public health websites that use different standards and categories for reporting race/ethnicity. Collecting these data and turning them into a unified dataset has largely been left to non-governmental organizations like the Covid Tracking Project, which announced that it will stop collecting data on March 7, 2021, a full year after it started. Until now, even the federal government has looked to the Covid Tracking Project for reliable COVID-19 race/ethnicity data. The office of the Assistant Secretary for Planning and Evaluation, an agency within the U.S. Department of Health and Human Services, wrote in October 2020, "The volunteer-based COVID tracking project has created the most comprehensive centralized resource for race and ethnicity data at the state level."

The outlook for race/ethnicity data on cases is even bleaker at the county level. The CDC now shows total case counts at the county level in a dashboard. Before the CDC published that data, several non-governmental organizations (New York Times, Johns Hopkins University, USAFacts) took it upon themselves to gather data for total case counts at the county level. But none of these sources collect or publish race/ethnicity data, which would have been a huge undertaking due to the non-standard reporting of race/ethnicity across state and local public health websites. The only public analysis of case data with race/ethnicity at the county level was in July 2020 when the New York Times published The Fullest Look Yet at the Racial Inequity of Coronavirus. The New York Times used CDC case surveillance data that they obtained via FOIA and legal action to do a one-time analysis of cases up to May 28, 2020.

After the Covid Tracking Project stops collecting data on March 7, 2021, how will we be able to track the disproportionate impact of COVID-19 on communities of color in the U.S.? Will we be able to get race/ethnicity breakdowns at the state level? What about at the county level? There are only two options for public COVID-19 case data with race/ethnicity as a unified dataset across the U.S.: The Covid Tracking Project, which is based on state public health websites and will stop data collection in March 2021, and the CDC's case surveillance data, which is based on state and local health departments reporting cases to the CDC. There are more options for data on deaths, which we discuss in a separate deaths data report.

The CDC publishes race/ethnicity case data at the U.S. level in a dashboard that's updated daily and publishes the underlying public dataset in a separate dashboard that's updated monthly. In November 2020, the CDC made the same underlying dataset available with state and county information, but with restricted access subject to a data use agreement. The CDC's initial restricted access data agreement did not allow for county-level analyses to be made public, but an updated data agreement from Dec 14, 2020 allowed such public analyses. In January 2021, the Morehouse School of Medicine's Satcher Health Leadership Institute (MSM/SHLI) in collaboration with Google.org applied for and got access to this data within a few days.

The CDC case surveillance dataset has enormous potential: It could allow us to analyze data across all states and counties to study the disparities in COVID-19 cases using consistent race/ethnicity categories. This dataset could replace the Covid Tracking Project's dataset after they shut down operations in March 2021 and also enable the first analysis of race/ethnicity disparities at the county level since July 2020. The dataset also has age and sex for each case so that we could look at the intersection of race/ethnicity with age and sex. The dataset has case report dates, which would allow us to look at changes in the data over time. However, the dataset will only live up to its potential if it is complete both in terms of the number of cases included and the number of cases that have race/ethnicity.

Unfortunately, the dataset has significant completeness issues:

  • Only 78% of total cases in the Covid Tracking Project up to January 16 are included (5.2M cases are missing)
  • Of the cases in the dataset, only 55% have known race/ethnicity (8.3M out of 18.4M cases are missing race/ethnicity)

For the 10.1M cases where we do know race/ethnicity, we can see the following disparities across race/ethnicity groups:

In [ ]:
#@title
overall_df = pd.io.gbq.read_gbq(cdc_overall_query, project_id=project_id)
overall_df['race_ethnicity_combined'] = overall_df.race_ethnicity_combined.astype('string').str.strip()
overall_df = overall_df.replace(to_replace={'race_ethnicity_combined': race_ethnicity_combined_map})
overall_df = overall_df.set_index('race_ethnicity_combined')

chart_denominator = 1000000
cases_list = [overall_df.cases['hispanic_cases'] / chart_denominator,
         overall_df.cases['black_cases'] / chart_denominator,
         overall_df.cases['white_cases'] / chart_denominator,
         overall_df.cases['asian_cases'] / chart_denominator,
         overall_df.cases['nhpi_cases'] / chart_denominator,
         overall_df.cases['aian_cases'] / chart_denominator,
         overall_df.cases.sum() / chart_denominator,
]

# Population data from https://api.census.gov/data/2019/acs/acs1/profile?get=NAME,DP05_0071E,DP05_0078E,DP05_0077E,DP05_0080E,DP05_0081E,DP05_0079E,DP05_0070E&for=us:1
pop_list = [
    60481746 / chart_denominator,
    40596040  / chart_denominator,
    196789401 / chart_denominator,
    18427914  / chart_denominator,
    565473 / chart_denominator,
    2236348 / chart_denominator,
    328239523 / chart_denominator,
]
percent_list = []
for i in range(len(cases_list)):
  percent_list.append(cases_list[i] / pop_list[i])
prevalence = pd.DataFrame.from_dict({'group': [
    'Hispanic/Latino',
    'Black',
    'White',
    'Asian',
    'Native Hawaiian/Pacific Islander',
    'American Indian/Alaska Native',
    '*Total Including Unknowns*',
], 'percent': percent_list,
   'cases': cases_list,
   'population': pop_list,
})
bars = alt.Chart(prevalence).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='.1%'), title=''),
      y=alt.Y('group', sort='-x', title=''),
      color=alt.Color('group', 
                      scale=alt.Scale(scheme='tableau20'),
                      title='',
                      legend=None),
      tooltip=[
                  alt.Tooltip('group:N', title='Race/Ethnicity Group'),
                  alt.Tooltip('percent:Q', format='.2%', title='Prevalence within group'),
                  alt.Tooltip('cases:Q', format=',.2f', title='Cases in group (millions)'),
                  alt.Tooltip('population:Q', format=',.2f', title='Population of group (millions)'),
      ]
).properties(
   title='Percent of Race/Ethnicity Group who had COVID-19 based on Incomplete CDC Data as of %s' % date_display_name
)

bars.display()
#alt.concat(bars).properties(
#    title=alt.TitleParams(
#        ['Source: U.S. Census Bureau\'s American Community Survey 2019 5-year estimates for population data.'],
#        baseline='bottom',
#        dy=20,
#        orient='bottom',
#        fontWeight='normal',
#        fontSize=11
#    )
#).display()

But the chart above is based on incomplete data. With only 78% of cases included, the total percent of people who had COVID-19 should be 7.2% instead of 5.6%. It's harder to estimate how much the individual race/ethnicity data are undercounting the true number of confirmed COVID-19 cases. For example, the CDC data say that 0% of cases in California were Hispanic/Latino people, whereas the California public health website reports that Hispanics/Latinos made up 55% of California cases (1.3M people) as of January 27.

If we added all 8.3M cases with missing race/ethnicity to the Hispanic/Latino group, the percent of Hispanic/Latinos in the U.S. who had COVID-19 would go from 3.2% to 16.9% — a 5x increase. If all 8.3M cases with missing race/ethnicity were Black people, the percent of Black people who had COVID-19 would go from 3.0% to 23.4% — an 8x increase. While these extreme scenarios are unlikely, they show us why missing race/ethnicity data is preventing us from truly understanding and addressing the disparities in the COVID-19 pandemic in the U.S.

At the same time, the data from state public health websites are not perfect; the Covid Tracking Project only has race/ethnicity data for 66% of cases. We'll examine which states and counties have data that are as reliable as state public health websites. If we can rely on the CDC's dataset for some states and counties, that could reduce the amount of manual data collection needed to replace the Covid Tracking Project's dataset.

Overview

The goal of this analysis is to assess the completeness of the CDC's Restricted Access dataset and its feasibility in examining disparities in race/ethnicity for COVID-19 cases at the state and county levels. We'll first give an overview of case data, aggregate data, and the tradeoffs between them. We will next compare the total case counts in the restricted access dataset to two reliable aggregate datasets at the state and county levels. We will then compare cases with race/ethnicity at the state level to the Covid Tracking Project's Covid Racial Data Tracker.

The top-level data completeness findings are:

  1. Data Overview: Most fields in the CDC's restricted access dataset are missing too many values to be useful. The only fields that are reliably filled in are dates of reporting and symptoms, case status (lab confirmed or probable), state, county, sex, age, and race/ethnicity. All other fields, including whether the person died or was hospitalized, are known for 50% or fewer of the cases. Race/ethnicity was only known for 55% of cases, as opposed to 97%-100% for all the other fields below.
In [ ]:
#@title
field_list = ['cdc_case_earliest_dt', 'current_status', 'res_state', 'res_county', 'sex', 'age_group', 'race_ethnicity_combined']
FieldAnalysis(project_id, cdc_table, field_list).display()
  1. Total Case Counts: The CDC dataset contains 78% of the total cases reported in the Covid Tracking Project (CTP). There's high variability at the state and county levels, where the state with the biggest discrepancy has only 3% of the total cases in the Covid Tracking Project. While it is expected that the CDC data will lag, a time lag alone can't explain the discrepancies in some states and counties.
  2. Cases with Race/Ethnicity: Race/ethnicity data is available for 55% of cases in the CDC dataset compared to 66% in the Covid Racial Data Tracker. Race/ethnicity data availability is highly variable across different states, which is common to both the CDC and Covid Racial Data Tracker datasets, but the Covid Racial Data Tracker has more cases with race/ethnicity information than the CDC in all but four states.

We used a composite measurement to evaluate the CDC dataset's completeness as compared to the Covid Racial Data Tracker (CRDT) at the state level. We calculated the percent of total cases that have race/ethnicity data and broke it down into its two separate components: the percentage of expected case counts included and the percentage of cases included with race/ethnicity. We looked at the number of states that had at least 50% of total case counts with race/ethnicity and those that had at least 85%.

In [ ]:
#@title
# Manually update these fields based on chart above, latest CDC data,
# and improving state/county data below.

row_names = [
    '% of CTP case count',
    '% with race/ethnicity',
    'Composite % of total with race/ethnicity',
    'Number of states with composite > 85%',
    '(as a percent of all states)',
    'Number of states with composite > 50%',
    '(as a percent of all states)',
]
crdt_metadata = [
    '100%', 
    '66%',
    '66%',
    '14',
    '(27%)',
    '49',
    '(96%)',
]
cdc_metadata = [
    '78%', 
    '55%',
    '43%',
    '2',
    '(4%)',
    '25',
    '(49%)',
]
table_data = {'CRDT': crdt_metadata, 'CDC': cdc_metadata}
metadata_df = pd.DataFrame(table_data, index=row_names)
metadata_df.head(15)
Out[ ]:
CRDT CDC
% of CTP case count 100% 78%
% with race/ethnicity 66% 55%
Composite % of total with race/ethnicity 66% 43%
Number of states with composite > 85% 14 2
(as a percent of all states) (27%) (4%)
Number of states with composite > 50% 49 25
(as a percent of all states) (96%) (49%)

The CDC dataset completeness falls short of the CRDT dataset across all states, but some states do have data as complete as the CRDT. We do the same analysis at the county level as well even though we can't compare it to any other datasets.

We also look at ways to improve the CDC's case surveillance data at the state and county levels. The CDC says they are "working with states to provide more information on race/ethnicity for reported cases. The percent of reported cases that include race/ethnicity data is increasing." Based on our analysis, most states and counties need to report more information for race and ethnicity with something other than Unknown or Missing. However, a few states like California are missing entire swaths of cases that point to larger issues within those states.

What we didn't include in this report:

Completeness Analysis

Data Overview

The case dataset comes from a case report form that is a dense, two-page form about each lab-confirmed or probable COVID-19 case (new version as of Jan 15, 2021). The restricted access dataset contains 32 fields, which are described on the CDC website. The public version of the restricted access dataset contains 12 of those fields. The CDC has extensive FAQs about the data, one of which is about completeness:

How complete are the data that the CDC receives about COVID-19 cases?

The COVID-19 pandemic has put unprecedented demands on the public health data supply chain. In many states, the large number of COVID-19 cases has severely strained the ability of hospitals, healthcare providers, and laboratories to report cases with complete demographic information, such as race and ethnicity. The unprecedented volume of cases has also limited the ability of state and local health departments to conduct thorough case investigations and collect all requested case data.

As a result, many COVID-19 case notifications submitted to CDC do not have complete information on patient demographics [...] Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Because of missing data, analyses of these data elements are likely an underestimate of the true occurrence.

The CDC distinguishes between aggregate data that comes from state and local public health websites vs. line- or case-level data that comes to the CDC from public health departments. The CDC FAQs say that aggregate data are more accurate than case data:

Aggregate counts provide the most up-to-date validated numbers on cases and deaths.

Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Although CDC receives this information for most cases, it does not receive it for all cases.

Aggregate data from public health websites often do contain race/ethnicity details, but all state websites do not all use the same standard race/ethnicity categories. The Covid Racial Data Tracker captures the many non-standard ways in which different states report on race/ethnicity, where ethnicity is whether a person is Hispanic/Latino. Some states report race/ethnicity as a combined field where each race/ethnicity group is mutually exclusive, which is how the CDC case dataset reports this field. Other states report race/ethnicity as separate fields where Hispanic/Latino people are counted within different race groups as well as in a separate field for ethnicity. States can also differ in terms of which race categories they use, how they define them, whether multiracial people are counted multiple times in different categories, and what's included in the "Other" race category. For more details, see this Covid Racial Data Tracker analysis.

So, we will need to sacrifice the accuracy and timeliness of aggregate data to get standardized race/ethnicity reporting on cases across all states and counties. However, standardized reporting on race/ethnicity is only useful if we have that data in enough states and counties.

In the CDC case dataset, race/ethnicity is known for only 55% of cases. The cases without known race/ethnicity fall into the following categories:

  1. Marked as "Unknown" on the case report form (40%)
  2. Missing due to being left blank on the form (4%)
  3. Suppressed for privacy reasons for small geographic and/or demographic population groups (2%)

The CDC discussed the incompleteness of race/ethnicity data in their case data FAQs:

[...] in many states, the large number of COVID-19 cases has severely strained the ability to report cases with complete demographic information for race and ethnicity. With thousands of cases being reported, completeness of these elements is unlikely to improve in the immediate future for some jurisdictions.

Has this dataset gotten more complete since the New York Times obtained a copy of the case surveillance data in May, 2020? Based on the comparison table below, the dataset has improved in terms of more counties included and a higher percent of cases with race/ethnicity and county; however, some of those differences may be due to the fact that there are simply more counties with COVID-19 cases in the more recent data. The percent of cases included in the dataset has also improved as compared to totals from the Covid Tracking Project (CTP).

In [ ]:
#@title
# Manually update these fields based on the latest CDC data.
# SELECT
# count(*) as count
# FROM `msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20210131`
# https://covidtracking.com/data/national
# County data calculated in Counties: CDC vs. NYT section.

row_names = [
    'Update frequency',
    'Latest cases date as of Feb 9, 2021',
    'Cases in dataset as of date',
    'Cases in CTP as of date',
    '(as a % of CTP)',
    'Number of counties',
    '(as a % of all counties)',
    'Population in those counties',
    '(as a % of total U.S population – States + D.C.)',
    'Cases with known race/ethnicity and county',
    '(as a % of cases in dataset)',
    'Access'
]
nyt_cdc_metadata = [
    'Once',
    'May 28, 2020',
    '1.4M',
    '1.7M',
    '(88%)',
    '974',
    '(31%)',
    '178M',
    '(~55%)',
    '~0.6M',
    '(44%)',
    'Not public'
]
cdc_metadata = [
    'Monthly', 
    'Jan 16, 2021',
    '18.4M',
    '23.6M',
    '(78%)',
    '3,061',
    '(97%)',
    '324M',
    '(99.8%)',
    '9.9M',
    '(54%)',
    'Restricted'
]
table_data = {'NYT/CDC': nyt_cdc_metadata, 'CDC': cdc_metadata}
metadata_df = pd.DataFrame(table_data, index=row_names)
metadata_df.head(15)
Out[ ]:
NYT/CDC CDC
Update frequency Once Monthly
Latest cases date as of Feb 9, 2021 May 28, 2020 Jan 16, 2021
Cases in dataset as of date 1.4M 18.4M
Cases in CTP as of date 1.7M 23.6M
(as a % of CTP) (88%) (78%)
Number of counties 974 3,061
(as a % of all counties) (31%) (97%)
Population in those counties 178M 324M
(as a % of total U.S population – States + D.C.) (~55%) (99.8%)
Cases with known race/ethnicity and county ~0.6M 9.9M
(as a % of cases in dataset) (44%) (54%)
Access Not public Restricted

Sources: NYT article and The Daily podcast episode about the article, CTP total case counts for the U.S. by date.

Total Case Counts

We will compare the CDC data against two sources of aggregate data: The CRDT and the NYT's public data, which are both updated on a regular basis (CRDT twice a week, NYT daily) and come from state and local public health websites. CRDT is the only source for case data with race/ethnicity breakdowns, but there are several sources for county-level aggregate case data in addition to the NYT, such as JHU and USAFacts (this paper analyzes the differences between those sources at the state level up to July for cases and deaths).

The table below compares geographic vs. race/ethnicity availability for these three different data sources:

  • NYT: New York Times COVID-19 Public Data
  • CRDT: Covid Racial Data Tracker Public Data
  • CDC: CDC Case Surveillance Restricted Access Data
In [ ]:
#@title
row_names = [
    'Total Cases — States',
    'Total Cases — Counties',
    'Cases by Race/Ethnicity — States',
    'Cases by Race/Ethnicity — Counties'
]
nyt_yn = [
    '✅',
    '✅',
    '❌',
    '❌',
]
crdt_yn = [
    '✅',
    '❌',
    '✅',
    '❌',
]
cdc_yn = [
    '✅',
    '✅',
    '✅',
    '✅',
]
table_data = {'NYT': nyt_yn, 'CRDT': crdt_yn, 'CDC': cdc_yn}
availability_df = pd.DataFrame(table_data, index=row_names)
availability_df.head()
Out[ ]:
NYT CRDT CDC
Total Cases — States
Total Cases — Counties
Cases by Race/Ethnicity — States
Cases by Race/Ethnicity — Counties

Because the CDC is the only data source that has race/ethnicity at the county level, the most similar data for purposes of comparison are (1) CRDT data at the state level with race/ethnicity, and (2) NYT data at the county level with no race/ethnicity.

We will compare across these data sources up to Jan 16, 2021, which is the latest reporting date in the CDC data. We expect to see differences in the case counts due to lags in reporting the data, but we don't expect that time lags can explain large percentages of missing cases.

Baseline: NYT vs. CRDT

To get a baseline of how much we could expect the CDC case counts to match the CRDT or NYT, we can see how closely the CRDT and NYT match each other. Each dot below is a state (hover to see details), and the black line shows where the NYT and CRDT case counts are equal.

In [ ]:
#@title
def CreateScatterPlot(
    chart_df, fields_dict, title, scale_max, height, width, geo, metric_type):
  
  geo_field = 'state'
  geo_field_display_name = 'State'
  if geo == 'county':
    geo_field = 'state_county'
    geo_field_display_name = 'County'

  if metric_type == 'ratio':
    scale_scheme = 'blueorange'
    scale_reverse = True
    scale_domain = [0, 2]
    legend_format = '.1f'
    axis_format = ',.0f'
  elif metric_type == 'percent':
    scale_scheme = 'redyellowblue'
    scale_reverse = False
    scale_domain = [0, 1]
    legend_format = '.0%'
    axis_format = '.0%'

  tooltips = [alt.Tooltip(geo_field + ':N', title=geo_field_display_name)]
  for field in ('y', 'x', 'percent'):
    tooltips.append(alt.Tooltip(
        fields_dict[field]['name'] + ':Q',
        format=fields_dict[field]['format'],
        title=fields_dict[field]['title'],
    ))
  plot = alt.Chart(chart_df).mark_circle(size=60).encode(
      alt.X(fields_dict['x']['name'] + ':Q', axis=alt.Axis(title=fields_dict['x']['title'], format=axis_format),
          scale=alt.Scale(domain=(0, scale_max))
      ),
      alt.Y(fields_dict['y']['name'] + ':Q', axis=alt.Axis(title=fields_dict['y']['title'], format=axis_format),
          scale=alt.Scale(domain=(0, scale_max))
      ),
      color=alt.Color(fields_dict['percent']['name'],
                      type='quantitative',
                      scale=alt.Scale(scheme=scale_scheme,
                                      reverse=scale_reverse,
                                      domain=scale_domain,
                                      clamp=True),
                      legend=alt.Legend(format=legend_format),
                      title=metric_type.capitalize()),
      tooltip=tooltips,
  ).properties(
      height=height,
      width=width,
  )
  if metric_type == 'ratio':
    plot.interactive()

  line = pd.DataFrame({
      'x': [0, scale_max],
      'y': [0, scale_max],
  })

  if metric_type == 'ratio':
    line_plot = alt.Chart(line).mark_line(color='black').encode(
        x='x',
        y='y',
    )
  elif metric_type == 'percent':
    line_plot = (
        alt.Chart(pd.DataFrame({'x': [.5]})).mark_rule().encode(y='x') +
        alt.Chart(pd.DataFrame({'y': [.5]})).mark_rule().encode(x='y')
    )
  # Add interative for concatenating due to https://github.com/altair-viz/altair/issues/2010.
  scatter = (plot + line_plot).properties(
      title=title,
      height=height,
      width=width,
  ).interactive()
  return scatter

def CreateMap(
    chart_df, fields_dict, title, scale_max, height, width, geo, metric_type):
  
  geo_field = 'state'
  geo_field_display_name = 'State'
  fips_code = 'state_fips_code'
  topo_feature = us_states
  if geo == 'county':
    geo_field = 'state_county'
    geo_field_display_name = 'County'
    fips_code = 'county_fips'
    topo_feature = us_counties

  if metric_type == 'ratio':
    scale_scheme = 'blueorange'
    scale_reverse = True
    scale_domain = [0, 2]
    legend_format = '.1f'
  elif metric_type == 'percent':
    scale_scheme = 'redyellowblue'
    scale_reverse = False
    scale_domain = [0, 1]
    legend_format = '.0%'

  highlight = alt.selection_single(on='mouseover', fields=['id', fips_code], empty='none')
  tooltips = [alt.Tooltip(geo_field + ':N', title=geo_field_display_name)]
  for field in ('y', 'x', 'percent'):
    tooltips.append(alt.Tooltip(
        fields_dict[field]['name'] + ':Q',
        format=fields_dict[field]['format'],
        title=fields_dict[field]['title'],
    ))

  field_names = [geo_field]
  field_names.extend([fields_dict[field]['name'] for field in fields_dict])
  plot = alt.Chart(topo_feature).mark_geoshape(
        stroke='white',
        strokeOpacity=.2,
        strokeWidth=1
    ).project(
      type='albersUsa'
    ).transform_lookup(
        lookup='id',
        from_=alt.LookupData(chart_df, fips_code, field_names)
    ).encode(
        alt.Color(fields_dict['percent']['name'],
                  type='quantitative',  
                  legend=alt.Legend(format=legend_format),
                  scale=alt.Scale(scheme=scale_scheme,
                                  reverse=scale_reverse,
                                  domain=scale_domain,
                                  clamp=True,
                                  ),
                  title=metric_type.capitalize()),
         tooltip=tooltips
    ).add_selection(
        highlight,
    )

  states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
        type='albersUsa'
  )

  states_fill = alt.Chart(us_states).mark_geoshape(
        fill='silver',
        stroke='white'
  ).project('albersUsa')

  layered_map = alt.layer(states_fill, plot, states_outline).properties(
        height=height,
        width=width,
        title=title,
  )
  return layered_map

def CreateScatterPlotAndMap(
    chart_df, fields_dict, title, total_cases_scale_max, scatter_height, scatter_width, map_width, geo, metric_type):
  scatter = CreateScatterPlot(
    chart_df, fields_dict, title, total_cases_scale_max, scatter_height, scatter_width, geo, metric_type)
  map = CreateMap(
    chart_df, fields_dict, title, total_cases_scale_max, scatter_height, map_width, geo, metric_type)
  return (scatter | map).configure_view(
       strokeWidth=0,
   ).configure_mark(
       stroke='grey'
   ).configure_legend(
       gradientLength=scatter_height - 50
   )

def PrintSummaryStats(chart_df, field='percent'):
  plus_minus_15_df = chart_df[chart_df[field] >= .85]
  plus_minus_15_df = plus_minus_15_df[plus_minus_15_df[field] <= 1.15]
  print('between +/-15%: ', len(plus_minus_15_df), round(len(plus_minus_15_df) / len(chart_df), 2))
  plus_minus_50_df = chart_df[chart_df[field] >= .50]
  plus_minus_50_df = plus_minus_50_df[plus_minus_50_df[field] <= 1.50]
  print('between +/-50%: ', len(plus_minus_50_df), round(len(plus_minus_50_df) / len(chart_df), 2))
  print('< than .50: ', len(chart_df[chart_df[field] < .5]))
  print('> than 1.50: ', len(chart_df[chart_df[field] > 1.5]))
  print(chart_df[field].describe())
In [ ]:
#@title
crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)
crdt_df.set_index('state', inplace=True)

nyt_states_df = pd.io.gbq.read_gbq(nyt_states_query, project_id=project_id)
nyt_states_df.state_fips_code.unique()
for territory in nyt_territories:
  nyt_states_df = nyt_states_df[nyt_states_df.state_name != territory]
nyt_states_df['state_fips_code'] = nyt_states_df.state_fips_code.astype(int)
nyt_states_df.set_index('state_fips_code', inplace=True)

crdt_df.reset_index(inplace=True)
crdt_df['state_fips_code'] = crdt_df.state
crdt_df = crdt_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_df.set_index('state_fips_code', inplace=True)
nyt_crdt_merged_df = nyt_states_df.join(crdt_df, on="state_fips_code", how='inner', lsuffix='_left', rsuffix='_right')

nyt_crdt_merged_df['percent'] = round(nyt_crdt_merged_df.nyt_cases / nyt_crdt_merged_df.crdt_cases, 2)
nyt_crdt_merged_df
nyt_crdt_merged_df.reset_index(inplace=True)

#PrintSummaryStats(nyt_crdt_merged_df)
In [ ]:
#@title
nyt_crdt_fields_dict = {
    'x': {'name': 'crdt_cases', 'format': ',', 'title': 'CRDT cases'},
    'y': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
    'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of NYT to CRDT'},
}
nyt_crdt_title = 'Ratio of NYT to CRDT Cases by State as of %s' % date_display_name

CreateScatterPlotAndMap(
    nyt_crdt_merged_df, nyt_crdt_fields_dict, nyt_crdt_title, total_cases_scale_max, scatter_height, scatter_width, map_width, 'state', 'ratio'
).display()

The ratio of NYT to CRDT cases is between 0.97 and 1.15 for all states:

  • Average = 1.00
  • Median = 1.00
  • Min = 0.97 (Tennessee)
  • Max = 1.15 (Georgia)
  • Percent between 0.85 and 1.15 = 100% (51 states + D.C. within +/- 0.15)

States: CDC vs. CRDT

We can see below that the CDC case counts differ from the CRDT case counts much more drastically than the NYT did. We compared the CDC case data from Jan 16 to the CRDT case data from Jan 17 because CRDT only reports data twice a week.

In [ ]:
#@title
cdc_states_df = pd.io.gbq.read_gbq(cdc_states_query, project_id=project_id)
cdc_states_df.rename(columns={'res_state': 'state'}, inplace=True)
cdc_states_df.set_index('state', inplace=True)

crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)

for territory in territories:
  crdt_df = crdt_df[crdt_df.state != territory]

crdt_df.set_index('state', inplace=True)
cdc_crdt_merged_df = cdc_states_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
cdc_crdt_merged_df.reset_index(inplace=True)
cdc_crdt_merged_df['state_fips_code'] = cdc_crdt_merged_df.state
cdc_crdt_merged_df = cdc_crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
cdc_crdt_merged_df['percent'] = round(cdc_crdt_merged_df.cdc_cases / cdc_crdt_merged_df.crdt_cases, 4)

#PrintSummaryStats(cdc_crdt_merged_df)
In [ ]:
#@title
cdc_crdt_fields_dict = {
    'x': {'name': 'crdt_cases', 'format': ',', 'title': 'CRDT cases'},
    'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to CRDT'},
}
cdc_crdt_title = 'Ratio of CDC to CRDT Cases by State as of %s' % date_display_name

CreateScatterPlotAndMap(
    cdc_crdt_merged_df, cdc_crdt_fields_dict, cdc_crdt_title, total_cases_scale_max, scatter_height, scatter_width, map_width, 'state', 'ratio'
).display()

Texas alone is missing 2M cases compared to the total case counts in the CRDT data.

The ratio of CDC to CRDT cases is between 0.03 and 1.86 for all states + D.C.:

  • Average = 0.76
  • Median = 0.87
  • Min = 0.03 (Texas, Wyoming)
  • Max = 1.86 (Alaska)
  • Percent between 0.85 and 1.15 = 51% (26 states within +/- 0.15)
  • Percent between 0.50 and 1.50 = 71% (36 states within +/- 0.50)

The 26 states that were within +/-15% of the CRDT data could plausibly be off due to time lags in reporting cases to the CDC vs. reporting them on state public health websites, but there are many outlier states that are too far off from the CRDT case counts to be explained by a time lag:

  • 14 states: < 0.50 ratio of CDC to CRDT cases (Texas, Wyoming, Louisiana, and West Virginia < 0.10)
  • 1 state > 1.50 ratio of CDC to CRDT cases (Alaska)

Counties: CDC vs. NYT

In [ ]:
#@title
# CDC vs. NYT county

df = pd.io.gbq.read_gbq(cdc_counties_query, project_id=project_id)
for territory in territories:
  df = df[df.res_state != territory]

df_county_fips_map = pd.io.gbq.read_gbq(county_fips_mapping_query, project_id=project_id)

df_county_fips_map.cdc_county = df_county_fips_map.cdc_county.str.lower()
df_county_fips_map['state_county'] = df_county_fips_map.state + '-' + df_county_fips_map.cdc_county
df_county_fips_map['state_county'] = df_county_fips_map.state_county.astype('string').str.strip()
df_county_fips_map.set_index('state_county', inplace=True)
In [ ]:
#@title
# Concatenate the state and county names because county names are not unique across states.
df.res_county = df.res_county.str.lower()
df['state_county'] = df.res_state + '-' + df.res_county
df['state_county'] = df.state_county.astype('string').str.strip()
df.set_index('state_county', inplace=True)
df['race_ethnicity_combined'] = df.race_ethnicity_combined.astype('string').str.strip()
df = df.replace(to_replace={'race_ethnicity_combined': race_ethnicity_combined_map})
In [ ]:
#@title
# Printed value used in the footnotes below.
# These checks for county_fips_code mappings are now in
# https://docs.google.com/spreadsheets/d/1AVSSge7BpkbNL4PfumUZpL7hokMLjKUojtamQjNW6f0/edit?resourcekey=0-Abdprx3fy_pXikSCDV2hxw#gid=967935006
mismatches_df = df.join(df_county_fips_map, on="state_county", how='outer', lsuffix='_left', rsuffix='_right')
mismatches_df = mismatches_df[mismatches_df.county_fips.isna()]
mismatches_df = mismatches_df[mismatches_df.res_state != 'NA']
mismatches_df = mismatches_df[mismatches_df.res_state != 'Unknown']
mismatches_df = mismatches_df[mismatches_df.res_county != 'na']
mismatches_df = mismatches_df[mismatches_df.res_county != 'unknown']
mismatches_df = mismatches_df[mismatches_df.res_county != 'missing']
#print(mismatches_df.cases.sum())
In [ ]:
#@title
merged_df = df.join(df_county_fips_map, on="state_county", how='inner', lsuffix='_left', rsuffix='_right')

# Create a crosstab table with rows = counties, columns = race_ethnicity_combined.
crosstab_df = pd.crosstab(merged_df['county_fips'], merged_df.race_ethnicity_combined, values=merged_df.cases, aggfunc=sum,
                          margins=True,
                          margins_name='total_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['county_fips'] = crosstab_df.county_fips.astype(int)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.na_cases.fillna(0) - crosstab_df.unknown_cases.fillna(0)
In [ ]:
#@title
# Get the display names for each county.
# Use ACS data that only has one FIPS code per county unlike the fips_county_map.
df_acs_name_lookup = pd.io.gbq.read_gbq(acs_population_data_query, project_id=project_id)

df_acs_name_lookup['state_county'] = df_acs_name_lookup.county.astype('string').str.strip() + ', ' + df_acs_name_lookup.state.astype('string').str.strip()
df_acs_name_lookup.drop(columns=['state', 'county'], inplace=True)
df_acs_name_lookup.set_index('county_fips', inplace=True)

county_chart_df = crosstab_df.join(df_acs_name_lookup, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
county_chart_df.county_fips = county_chart_df.county_fips.astype(int)

#print(len(county_chart_df))
#print(len(county_chart_df) / 3143)
#print(county_chart_df.total_pop.sum())
#print(county_chart_df.total_pop.sum() / 324697795)  # Population covered in these counties
#print(county_chart_df.total_known_cases.sum())
#print(0.55 * 324697795) # NYT population
In [ ]:
#@title

nyt_counties_df = pd.io.gbq.read_gbq(nyt_counties_query, project_id=project_id)
nyt_counties_df.rename(columns={'county_fips_code': 'county_fips'}, inplace=True)
nyt_counties_df.county_fips.unique()
nyt_counties_df['county_fips'] = nyt_counties_df.county_fips.astype(int)
nyt_counties_df.set_index('county_fips', inplace=True)

county_chart_df.set_index('county_fips', inplace=True)
nyt_merged_df = county_chart_df.join(nyt_counties_df, on="county_fips", how='left', lsuffix='_left', rsuffix='_right')
nyt_merged_df = nyt_merged_df.reset_index()
nyt_merged_df['percent'] = round(nyt_merged_df.total_cases / nyt_merged_df.nyt_cases, 2)

#PrintSummaryStats(nyt_merged_df)

We can do the same analysis at the county level using the CDC vs. NYT data.

Each dot is a county (hover to see details). We show all 3,054 counties in the CDC data that were also in the NYT data on the left and zoom in on the smaller counties on the right. Note that the five counties in New York City are missing because the NYT combined them into one region.

In [ ]:
#@title
cdc_nyt_fields_dict = {
    'x': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
    'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to NYT'},
}
cdc_nyt_title = 'Ratio of CDC Cases to NYT Cases by County as of Dec 16'
zoom_cdc_nyt_title = 'Zoom in on counties with up to 100,000 Cases'

scatter = CreateScatterPlot(
    nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, county_cases_scale_max, scatter_height, scatter_width, 'county', 'ratio'
)
zoom_scatter = CreateScatterPlot(
    nyt_merged_df, cdc_nyt_fields_dict, zoom_cdc_nyt_title, county_cases_zoom_scale_max, scatter_height, scatter_width, 'county', 'ratio'
)

(scatter | zoom_scatter).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).configure_mark(
    stroke='grey'
).display()

Harris County, Texas is missing 280K cases compared to the total case counts in the NYT data.

The ratio of CDC to NYT cases is between 0.00 and 9.80 for the 3,054 counties in the CDC data that were also in the NYT data:

  • Average = 0.71
  • Median = 0.83
  • Min = 0.00
  • Max = 12.6 (Lake and Peninsula Borough, Alaska)
  • Percent between 0.85 and 1.15 = 47% (1,447 counties within +/- 0.15)
  • Percent between 0.50 and 1.50 = 71% (2,163 counties within +/- 0.50)

We can also view these ratios on the map on the right and compare them to the state-level totals map from the previous section on the left.

In [ ]:
#@title
cdc_nyt_fields_dict = {
    'x': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
    'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to NYT'},
}
cdc_nyt_title = 'Ratio of CDC Cases to NYT Cases by County as of %s' % date_display_name

cdc_nyt_map = CreateMap(
    nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, total_cases_scale_max, map_height, map_width, 'county', 'ratio'
)
cdc_crdt_map = CreateMap(
    cdc_crdt_merged_df, cdc_crdt_fields_dict, cdc_crdt_title, total_cases_scale_max, map_height, map_width, 'state', 'ratio'
)

(cdc_crdt_map | cdc_nyt_map).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).display()

Notes:

  • The legend only goes to 2.0, and all counties with a larger ratio are shown in the same dark blue color.
  • A larger version of the county map for hovering over smaller counties is available in the Appendix.

We can see that the ratio of the CDC case data to CRDT/NYT aggregate data is highly variable across the U.S., but there is less variability across the counties within each state. This pattern indicates that the data completeness issues may be due to state-level policies or data collection processes rather than at the county level. We can also see that some counties are missing entirely from the data; e.g., in Texas, Wyoming, West Virginia, and Nebraska. It's possible some of these counties have cases in the data but the county name was suppressed for privacy reasons due to small population sizes. Even so, those cases would still have a state name, so they would be captured in the map on the left above.

Cases with Race/Ethnicity

States and Counties: CDC

In [ ]:
#@title
states_df = pd.io.gbq.read_gbq(compare_cases_unknowns_query, project_id=project_id)
for state in ('Unknown', 'NA', 'OCONUS'):
  states_df = states_df[states_df.res_state != state]

states_df['race_ethnicity_combined'] = states_df.race_ethnicity_combined.astype('string').str.strip()
states_df = states_df.replace(to_replace={'race_ethnicity_combined': {
    'Asian, Non-Hispanic': 'cdc_known_cases',
    'Black, Non-Hispanic': 'cdc_known_cases',
    'White, Non-Hispanic': 'cdc_known_cases',
    'American Indian/Alaska Native, Non-Hispanic': 'cdc_known_cases',
    'Hispanic/Latino': 'cdc_known_cases',
    'Multiple/Other, Non-Hispanic': 'cdc_known_cases',
    'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'cdc_known_cases',
    'Missing': 'cdc_unknown_cases',
    'Unknown': 'cdc_unknown_cases',
    'NA': 'cdc_na_cases',
    }})
states_df.rename(columns={'res_state': 'state'}, inplace=True)
In [ ]:
#@title
crosstab_df = pd.crosstab(states_df['state'], states_df.race_ethnicity_combined, values=states_df.cdc_cases, aggfunc=sum,
                          margins=True,
                          margins_name='cdc_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['cdc_known_or_na_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df['cdc_known_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_na_cases.fillna(0) - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df

crdt_merged_df = crosstab_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
crdt_merged_df.reset_index(inplace=True)
crdt_merged_df['state_fips_code'] = crdt_merged_df.state
crdt_merged_df = crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_merged_df['cdc_known_cases_percent'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['cdc_known_or_na_cases_percent'] = round(crdt_merged_df.cdc_known_or_na_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_cases, 4)
crdt_merged_df['percent_known_cases'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.crdt_known_race_cases, 4)

crdt_merged_df_no_ny = crdt_merged_df[crdt_merged_df.state != 'NY']
#PrintSummaryStats(crdt_merged_df_no_ny)

When evaluating the percent of cases that report on race/ethnicity in the CDC dataset, we also need to consider the 2% of overall cases with race/ethnicity that were suppressed due to privacy reasons. We should give states and counties credit for reporting race/ethnicity data for those cases even if we aren't able to use it due to privacy suppression. Below, the maps on the top left shows the percent of cases with known race/ethnicity and the map on the top right shows the percent of cases with known or suppressed race/ethnicity. The maps on the bottom show the same information at the county level.

In [ ]:
#@title

chart_df = county_chart_df.copy(deep=True)
chart_df.reset_index(inplace=True)
chart_df.county_fips = chart_df.county_fips.astype(int)
chart_df['percent_known_cases'] = round(chart_df.total_known_cases / chart_df.total_cases, 2)
chart_df['total_known_or_na_cases'] = chart_df.total_known_cases.fillna(0) + chart_df.na_cases.fillna(0)
chart_df['percent_known_or_na_cases'] = round(chart_df.total_known_or_na_cases / chart_df.total_cases, 2)
In [ ]:
#@title
cdc_known_state_fields_dict = {
    'x': {'name': 'cdc_known_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
    'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'cdc_known_cases_percent', 'format': '.0%', 'title': 'Percent known cases'},
}

cdc_known_state_title = 'CDC Cases with Known Race/Ethnicity as of %s' % date_display_name
cdc_known_state_map = CreateMap(
    crdt_merged_df, cdc_known_state_fields_dict, cdc_known_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)

cdc_known_or_na_state_fields_dict = {
    'x': {'name': 'cdc_known_or_na_cases', 'format': ',', 'title': 'Known or suppressed race/ethnicity cases'},
    'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'cdc_known_or_na_cases_percent', 'format': '.0%', 'title': 'Percent known or suppressed cases'},
}
cdc_known_or_na_state_title = 'CDC Cases with Known+Suppressed Race/Ethnicity as of %s' % date_display_name
cdc_known_or_na_state_map = CreateMap(
    crdt_merged_df, cdc_known_or_na_state_fields_dict, cdc_known_or_na_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)

(cdc_known_state_map | cdc_known_or_na_state_map).configure(
    padding={"left": 0, "top": 5, "right": 0, "bottom": 5}
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).display()
In [ ]:
#@title
cdc_known_county_fields_dict = {
    'x': {'name': 'total_known_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
    'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent_known_cases', 'format': '.0%', 'title': 'Percent known cases'},
}
cdc_known_county_title = 'CDC Cases with Known Race/Ethnicity as of %s' % date_display_name
cdc_known_county_map = CreateMap(
    chart_df, cdc_known_county_fields_dict, cdc_known_county_title, total_cases_scale_max, map_height, map_width, 'county', 'percent'
)

cdc_known_or_na_county_fields_dict = {
    'x': {'name': 'total_known_or_na_cases', 'format': ',', 'title': 'Known or suppressed race/ethnicity cases'},
    'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent_known_or_na_cases', 'format': '.0%', 'title': 'Percent known or suppressed cases'},
}
cdc_known_or_na_county_title = 'CDC Cases with Known+Suppressed Race/Ethnicity as of %s' % date_display_name
cdc_known_or_na_county_map = CreateMap(
    chart_df, cdc_known_or_na_county_fields_dict, cdc_known_or_na_county_title, total_cases_scale_max, map_height, map_width, 'county', 'percent'
)

(cdc_known_county_map | cdc_known_or_na_county_map).configure(
    padding={"left": 0, "top": 5, "right": 0, "bottom": 5}
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).display()
In [ ]:
#@title
#PrintSummaryStats(crdt_merged_df, field='cdc_known_cases_percent')
#PrintSummaryStats(crdt_merged_df, field='cdc_known_or_na_cases_percent')
#tuple(crdt_merged_df[crdt_merged_df.cdc_known_or_na_cases_percent <= .5].state)

Note: A larger version of the county maps for hovering over smaller counties is available in the Appendix.

We can see that the maps on the right are bluer than those on the left, which means that states and counties are doing a better job at reporting race/ethnicity when we consider the data that were suppressed for privacy reasons. This effect is most pronounced in states like Wyoming, Texas, and Louisiana, which have many counties with small populations or small population subgroups.

We can see the increase in the percent of cases with known race/ethnicity --> known or suppressed suppressed across all states:

  • Average = 59% --> 64%
  • Median = 63% --> 69%
  • Min = 3% --> 13%
  • Max = 89% --> 96%
  • Percent above 85% = 6% --> 14% (3 --> 7 states)
  • Percent above 50% = 76% --> 78% (39 --> 40 states)

However, even if you include the cases with suppressed race/ethnicity, California alone is still missing race/ethnicity data for 2.3M cases. Los Angeles County alone is responsible for 817K of the cases in California missing race/ethnicity data.

States: CDC vs. CRDT

How does the CDC dataset compare to the CRDT dataset, which is the most up-to-date aggregate dataset we have for race/ethnicity at the state level? Overall, 66% of the cases in the CRDT data have race/ethnicity compared to 55% in the CDC data (57% with suppressed data).

We may even be undercounting the 66% of cases with known race/ethnicity in the CRDT data because of the non-standard ways that each state reports on race/ethnicity, as described in this Covid Racial Data Tracker analysis. If a state uses a combined race/ethnicity field, then it's a straightforward comparison to the CDC's combined race/ethnicity field. If a state uses separate fields for race/ethnicity, then we still use the number of people with known race within each state because all of the race categories will also contain Hispanic/Latino people. We could potentially be undercounting the number of people with known race/ethnicity in the CRDT if there are people who have unknown race but known ethnicity. If we adjusted the numbers in those cases, it would make the CRDT percentages look even better in comparison to the CDC data.

In [ ]:
#@title
crdt_known_state_fields_dict = {
    'x': {'name': 'crdt_known_race_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
    'y': {'name': 'crdt_cases', 'format': ',', 'title': 'CRDT cases'},
    'percent': {'name': 'crdt_known_race_cases_percent', 'format': '.0%', 'title': 'Percent known cases'},
}

crdt_known_state_title = 'CRDT Cases with Known Race/Ethnicity as of %s' % date_display_name
crdt_known_map = CreateMap(
    cdc_crdt_merged_df, crdt_known_state_fields_dict, crdt_known_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)

(crdt_known_map | cdc_known_state_map).configure(
    padding={"left": 0, "top": 5, "right": 0, "bottom": 5}
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).display()
In [ ]:
#@title
#print('% known in CRDT: ', crdt_merged_df.crdt_known_race_cases.sum() / crdt_merged_df.crdt_cases.sum())
#PrintSummaryStats(cdc_crdt_merged_df, field='crdt_known_race_cases_percent')

The percent of CRDT cases with known race/ethnicity is between 0% and 99% for all states:

  • Average = 74%
  • Median = 77%
  • Min = 0% (New York)
  • Max = 99% (District of Columbia)
  • Percent above 85% = 27% (14 states)
  • Percent above 50% = 96% (49 states)

Overall, the CRDT has a higher percentage of cases with known race/ethnicity than CDC at the state level. Although it appears that the CDC has better data for Texas than the CRDT, the maps above don't account for the fact that the CDC only contains 3% of the cases in the CRDT data. To take that into account, we can compare the number of cases within each state that has known race/ethnicity instead of the percent of cases.

In [ ]:
#@title
fields_dict = {
    'x': {'name': 'crdt_known_race_cases', 'format': ',', 'title': 'CRDT known race/ethnicity cases'},
    'y': {'name': 'cdc_known_cases', 'format': ',', 'title': 'CDC known race/ethnicity cases'},
    'percent': {'name': 'percent_known_cases', 'format': '.2f', 'title': 'Ratio of CDC to CRDT'},
}
title = 'Ratio of CDC to CRDT Cases with Known Race/Ethnicity as of %s' % date_display_name

CreateScatterPlotAndMap(
    crdt_merged_df, fields_dict, title, cases_known_scale_max, scatter_height, scatter_width, map_width - 5, 'state', 'ratio'
).display()
In [ ]:
#@title
#PrintSummaryStats(crdt_merged_df_no_ny, field='percent_known_cases')

Notes:

  • New York is marked as gray in the map because it has 0 cases with known race/ethnicity in CRDT vs. 345K cases in the CDC data.

The ratio of CDC to CRDT cases with known race/ethnicity is between 0.01 and 1.07 for all states excluding New York:

  • Average = 0.60
  • Median = 0.71
  • Min = 0.01 (North Dakota, Louisiana, Wyoming)
  • Max = 1.07 (Massachusetts, New Jersey)
  • Percent between 0.85 and 1.15 = 28% (14 states within +/- 0.15)
  • Percent between 0.50 and 1.50 = 68% (34 states within +/- 0.50)

CRDT has race/ethnicity data for 1.5M more people in California and 764K more people in Florida than the CDC data has.

Overall, the CRDT is a more complete source for race/ethnicity dataset at the state level than the CDC data in terms of both the counts of cases with race/ethnicity data and the percentage of cases with race/ethnicity data. The only exceptions to this are New York, which has no cases with race/ethnicity, and possibly Massachusetts and New Jersey, which have 1.07 times as many cases with race/ethnicity than the CRDT.

What State and County Data are Usable?

How can states and counties improve their completeness for race/ethnicity data, especially when compared to the more reliable and up-to-date aggregate data that come from public health websites, as collected by the CRDT and NYT?

There are two ways in which states can improve the data they send to the CDC:

  1. Increase the total cases reported to get closer to the aggregate data.
  2. Increase the percentage of cases reported with known race/ethnicity to get closer to 100%.

In the Total Case Counts section above, we identified the states and counties with the biggest discrepancies relative to aggregate data. In the Cases with Race/Ethnicity section, we looked at the percentage of cases within each state and county that have race/ethnicity data.

The charts below show those two components together; the scatterplots show (1) the CDC case counts as a percentage of the CRDT/NYT total case counts on the y-axis, and (2) the percentage of CDC cases with known race/ethnicity on the x-axis. The colors of the dots and on the map show the product of those two numbers, which is the percentage of expected total cases that have race/ethnicity in the CDC dataset.

The scatterplots below can help us diagnose the issues in each state or county:

  • Bottom left quadrant: Low percentage of cases reported, low reporting of race/ethnicity (and/or high suppression).
  • Top left quadrant: Mid-to-high percentage of cases reported, low reporting of race/ethnicity (and/or high suppression).
  • Bottom right quadrant: Low percentage of cases reported, mid-to-high reporting of race/ethnicity (and/or high suppression).
  • Top right quadrant: Mid-to-high percentage of cases reported, mid-to-high reporting of race/ethnicity (and/or high suppression).
In [ ]:
#@title
nyt_cdc_known_merged_df = chart_df.join(nyt_counties_df, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
nyt_cdc_known_merged_df.reset_index(inplace=True)
nyt_cdc_known_merged_df['percent'] = round(nyt_cdc_known_merged_df.total_cases / nyt_cdc_known_merged_df.nyt_cases, 2)
In [ ]:
#@title
crdt_merged_df['percent_max_100'] = crdt_merged_df.percent.clip(upper=1)
crdt_merged_df['percent_reccs'] = crdt_merged_df.percent_max_100 * crdt_merged_df.cdc_known_cases_percent
state_reccs_fields_dict = {
    'y': {'name': 'percent_max_100', 'format': '.0%', 'title': 'CDC percent of CRDT total cases'},
    'x': {'name': 'cdc_known_cases_percent', 'format': '.0%', 'title': 'CDC percent with race/ethnicity'},
    'percent': {'name': 'percent_reccs', 'format': '.0%', 'title': 'Product: CDC percent of CRDT total with race/ethnicity'},
}
state_reccs_title = 'CDC Percent of Total Cases x Race/Ethnicity as of %s' % date_display_name

scatter = CreateScatterPlotAndMap(
    crdt_merged_df, state_reccs_fields_dict, state_reccs_title, 1, scatter_height, scatter_width, map_width, 'state', 'percent'
)
scatter.configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).configure_mark(
    stroke='grey'
).display()
In [ ]:
#@title
nyt_cdc_known_merged_df['percent_max_100'] = nyt_cdc_known_merged_df.percent.clip(upper=1)
nyt_cdc_known_merged_df['percent_reccs'] = nyt_cdc_known_merged_df.percent_max_100 * nyt_cdc_known_merged_df.percent_known_cases
county_reccs_fields_dict = {
    'y': {'name': 'percent_max_100', 'format': '.0%', 'title': 'CDC percent of NYT total cases'},
    'x': {'name': 'percent_known_cases', 'format': '.0%', 'title': 'CDC percent with race/ethnicity'},
    'percent': {'name': 'percent_reccs', 'format': '.0%', 'title': 'Product: CDC percent of NYT total with race/ethnicity'},
}
county_reccs_title = 'CDC Percent of Total Cases x Race/Ethnicity as of %s' % date_display_name

scatter = CreateScatterPlotAndMap(
    nyt_cdc_known_merged_df, county_reccs_fields_dict, county_reccs_title, 1, scatter_height, scatter_width, map_width, 'county', 'percent'
)
scatter.configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).configure_mark(
    stroke='grey'
).display()
In [ ]:
#@title
#PrintSummaryStats(crdt_merged_df, field='percent_reccs')
#PrintSummaryStats(nyt_cdc_known_merged_df, field='percent_reccs')
#greater_than_85_df = nyt_cdc_known_merged_df[nyt_cdc_known_merged_df['percent_reccs'] > .85]
#print('total pop > 85%: ', greater_than_85_df.total_pop.sum(), greater_than_85_df.total_pop.sum() / 328239523)
#greater_than_50_df = nyt_cdc_known_merged_df[nyt_cdc_known_merged_df['percent_reccs'] > .50]
#print('total pop > 50%: ', greater_than_50_df.total_pop.sum(), greater_than_50_df.total_pop.sum() / 328239523)

Notes:

  • All states or counties with > 100% of the total cases in the CRDT or NYT data were capped at 100%.
  • A larger version of the county map for hovering over smaller counties is available in the Appendix.

We can get an overall measure of completeness if we look at the number of states in the top right corner of the scatterplot where the composite score is > 85% (where the dots turn dark blue) and > 50% (where the dots turn yellow).

In [ ]:
#@title
# Manually update these fields based on the latest CDC data.
row_names = [
    'Number of states with composite > 85%',
    '(as a percent of all states)',
    'Number of states with composite > 50%',
    '(as a percent of all states)',
    'Number of counties with composite > 85%',
    '(as a percent of all counties)',
    'Number of counties with composite > 50%',
    '(as a percent of all counties)',
    'Population in counties with composite > 85%',
    '(as a % of total U.S population – States + D.C.)',
    'Population in counties with composite > 50%',
    '(as a % of total U.S population – States + D.C.)',
]
crdt_metadata = [
    '14',
    '(27%)',
    '49',
    '(96%)',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',]
cdc_metadata = [
    '2',
    '(4%)',
    '25',
    '(49%)',
    '125',
    '(4%)',
    '1,303',
    '(41%)',
    '9M',
    '(3%)',
    '122M',
    '(37%)',
]
table_data = {'CRDT': crdt_metadata, 'CDC': cdc_metadata}
metadata_df = pd.DataFrame(table_data, index=row_names)
metadata_df.head(15)
Out[ ]:
CRDT CDC
Number of states with composite > 85% 14 2
(as a percent of all states) (27%) (4%)
Number of states with composite > 50% 49 25
(as a percent of all states) (96%) (49%)
Number of counties with composite > 85% - 125
(as a percent of all counties) - (4%)
Number of counties with composite > 50% - 1,303
(as a percent of all counties) - (41%)
Population in counties with composite > 85% - 9M
(as a % of total U.S population – States + D.C.) - (3%)
Population in counties with composite > 50% - 122M
(as a % of total U.S population – States + D.C.) - (37%)

If we require that states or counties have 85% of total expected cases with race/ethnicity, that limits us to 4% of states and 4% of counties, where those counties account for 3% of the U.S. population. If we loosen that requirement to 50% of total expected cases with race/ethnicity, that limits us to 49% of states and 43% of counties, where those counties account for 37% of the U.S. population. The CRDT has the best available data overall from state public health websites, but only 27% of states in their data had at least 85% of total expected cases with race/ethnicity.

How to Improve State and County Data

For states and counties to improve the percentage of cases with known race/ethnicity, most need to reduce their number of Unknowns or missing values. A few states, however, seem to have errors where some race/ethnicity groups are almost entirely missing from the dataset:

  • California (24% known or suppressed): 0% of the cases are Hispanic/Latino (10 cases out of 1.9M).
  • North Dakota (13% known or suppressed): The only races reported are Asian and American Indian / Alaska Native.
  • Delaware (13% known or suppressed): White and Black cases are both 0% of total (10 cases each out of 46K).

Another way to evaluate states with issues is to see if they are improving over time with the percentage of cases with known race/ethnicity. We can look at the 11 states in the left side of the states scatterplot above that have < 50% of cases with known or suppressed race/ethnicity.

In [ ]:
#@title
cdc_states_by_month_df = pd.io.gbq.read_gbq(cdc_states_by_month_query, project_id=project_id)
cdc_states_by_month_df.set_index(keys=['res_state', 'date'], inplace=True)

cdc_states_by_month_known_or_na_df = pd.io.gbq.read_gbq(cdc_states_by_month_known_or_na_query, project_id=project_id)
cdc_states_by_month_known_or_na_df.set_index(keys=['res_state', 'date'], inplace=True)

cdc_known_over_time = cdc_states_by_month_df.join(cdc_states_by_month_known_or_na_df, how='left')
cdc_known_over_time['percent_known_or_na'] = round(cdc_known_over_time.known_or_na_cases / cdc_known_over_time.total_cases, 2)
cdc_known_over_time.reset_index(inplace=True)
In [ ]:
#@title
base = alt.Chart(cdc_known_over_time).mark_line(point=True).encode(
    x=alt.X('date', title='CDC earliest report date', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('percent_known_or_na', title='Percent unknown or suppressed race/ethnicity', axis=alt.Axis(format='%')),
    color=alt.Color('res_state', scale=alt.Scale(scheme='category20'), title='State')
).properties(
    title='States with fewer than 50% of Cumulative Cases with Known or Suppressed Race/Ethnicity',
    height=map_height,
    width=map_width
).display()

We can see that a few of these states have improved over time and now have more than 50% of cases with known or suppressed race/ethnicity: Louisiana, New York, and Georgia. A few states started off the year with greater than 50%: Alaska, Connecticut, and Maryland. But none of these states have fully fixed their issues with the possible exception of Louisiana if it continues on its current trajectory.

Appendix

Additional CDC data fields

The additional fields in the data, including whether the person died or was hospitalized, are all known for fewer than 50% of cases.

In [ ]:
#@title
field_list = ['death_yn', 'hosp_yn', 'icu_yn', 'onset_dt', 'pos_spec_dt', 'hc_work_yn',
              'pna_yn', 'abxchest_yn', 'acuterespdistress_yn', 'mechvent_yn', 'fever_yn', 'sfever_yn', 'chills_yn', 'myalgia_yn', 'runnose_yn',
              'sthroat_yn', 'cough_yn', 'sob_yn', 'nauseavomit_yn', 'headache_yn', 'abdom_yn', 'diarrhea_yn', 'medcond_yn']
FieldAnalysis(project_id, cdc_table, field_list).display()

The CDC also commented on these fields in their case data FAQs:

Because of the volume of cases, most health departments are unable to conduct investigations of every case to obtain additional information. Because of this, most case reports are missing data on patient demographics, symptoms, underlying health conditions, characteristics of hospitalizations such as ventilator use, and other factors such as recent travel history.

The case report form contains many more fields, but unfortunately, the fields get less complete as you go down the form. Citizens for Responsibility and Ethics in Washington (CREW) obtained a version of this data via FOIA that contained 101 fields with data up to Aug 25, 2020 and shared it with MSM/SHLI. Several of the additional fields from that dataset are shown below; the field with the most known values is whether the case was associated with an outbreak, but even that is only known for 30% of cases.

In [ ]:
#@title
field_list = ['death_week', 'icu_length', 'hosp_length', 'translator_yn', 'housing', 'exp_work_critical', 'outbreak_associated',
              'rigors_yn', 'taste_yn', 'fatigue_yn', 'wheezing_yn', 'diffbreathing_yn', 'chestpain_yn', 'test_pcr', 'test_serologic',
              'exp_adultfacility', 'exp_airport', 'exp_animal', 'exp_community', 'exp_gathering', 'exp_contact', 'exp_correctional',
              'exp_ship', 'exp_house', 'exp_other', 'exp_school', 'exp_othcountry', 'exp_unk', 'exp_work']
project_id = 'msm-internal-data'
FieldAnalysis(project_id, crew_table, field_list).display()

Large county maps

To make it easier to hover over small counties, here are larger versions of the county maps that appeared in this report.

In [ ]:
#@title
cdc_nyt_map = CreateMap(
    nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'ratio'
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
)
cdc_nyt_map.display()
In [ ]:
#@title
cdc_known_county_map = CreateMap(
    chart_df, cdc_known_county_fields_dict, cdc_known_county_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'percent'
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
)

cdc_known_county_map.display()
In [ ]:
#@title
cdc_known_or_na_county_map = CreateMap(
    chart_df, cdc_known_or_na_county_fields_dict, cdc_known_or_na_county_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'percent'
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
)
cdc_known_or_na_county_map.display()
In [ ]:
#@title
county_completeness = CreateMap(
    nyt_cdc_known_merged_df, county_reccs_fields_dict, county_reccs_title, 1, map_height * 2, map_width * 2, 'county', 'percent'
)
county_completeness.configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).configure_mark(
    stroke='grey'
).display()

Geographic notes

The CDC Case Surveillance dataset includes a county_fips_code field with a unique identifier for each county. However, we ended up using a different lookup for state and county to county FIPS codes due to some data quality issues. When we used the county_fips_code field provided in the CDC dataset, 21.5K records with known state and county values had no county_fips_code, including all cases in D.C. (and Long Beach, CA, Pasadena, CA, and Alameda, CA in the Dec version of the dataset). We created a lookup using American Community Survey (ACS) 2019 5-year estimates data and then modified the lookup to handle cases of misspellings and other issues in the CDC dataset. We documented the changes to the ACS mapping and included the new mapping in this spreadsheet.

With the new mapping, we now match all but 1.1K cases with known state and county values to county FIPS codes. We also identified 60 non-existent state-county combinations listed here that the CDC file was matching to county_fips_codes for 621 cases. We no longer match them to any county_fips_codes, but we do report them in the state-level data matching that state.

One geographical exception worth noting is that two areas in Alaska reported their numbers in a combined way:

  • Hoonah-angoon And Yakutat Combined: 54 cases
  • Bristol Bay And Lake And Peninsula Combined: 106 cases

We excluded these cases from the county-level analysis above.

When comparing to the NYT county-level data, some counties are excluded from the comparison due to the way that the NYT handles New York City as one geographic unit with no associated county fips code instead of 5 separate counties.

Data Citations and Disclaimers

  • CDC data full citation: Centers for Disease Control and Prevention, COVID-19 Response. COVID-19 Case Surveillance Data Access, Summary, and Limitations (version date: January 31, 2021).
  • Per the CDC data agreement: The CDC does not take responsibility for the scientific validity or accuracy of methodology, results, statistical analyses, or conclusions presented.
  • Population data: U.S. Census Bureau's American Community Survey 2019 5-year estimates accessed via API; e.g., sample query.
  • Covid Racial Data Tracker data: Available in a public spreadsheet.
  • New York Times data: Available as a public CSV file or via Google Cloud Public Datasets.

Contact information

Please email us at shli-covid-data-analysis@googlegroups.com with questions or comments.

In [ ]:
#%%shell
#jupyter nbconvert --to html 'cdc_case_data.ipynb'